🎙️ COMPLETE ROADMAP: Building Text-to-Speech (TTS) & Speech-to-Text (STT) Models & Services

From Scratch to Production — Beginner → Advanced → Research Level

Roadmap Version: 2025 | Last Updated: March 2025 | Author Note: Covers 2–3 years of full-stack speech AI development.

1. FOUNDATION PREREQUISITES

1.1 Mathematics

Linear Algebra: Vectors, matrices, dot products, SVD, eigenvalues
- Why: All neural networks are matrix operations
Calculus: Derivatives, gradients, chain rule, partial derivatives
- Why: Backpropagation relies on chain rule
Probability & Statistics: Gaussian distributions, Bayesian inference, MLE, MAP
- Why: Acoustic models are probabilistic; language models use probability
Signal Processing Mathematics
- Fourier Transform (DFT, FFT)
- Convolution theorem
- Z-Transform
- Nyquist-Shannon sampling theorem
- Windowing functions (Hamming, Hann, Blackman)

1.2 Programming Languages

Python (primary — 90% of ML/audio research)
- NumPy, SciPy, Matplotlib
- OOP, functional patterns, async programming
C++ (for low-latency inference engines)
JavaScript/TypeScript (for web APIs and browser-based STT/TTS)
Shell/Bash (for pipeline automation, data processing)

1.3 Deep Learning Foundations

Forward & backward propagation
Activation functions (ReLU, GELU, Sigmoid, Softmax)
Optimizers (SGD, Adam, AdamW, Lion)
Regularization (Dropout, BatchNorm, LayerNorm, Weight Decay)
Loss functions (Cross-entropy, CTC, MSE, L1)
Sequence modeling fundamentals

1.4 Audio/Signal Processing Basics

What is sound? Pressure waves, frequency, amplitude
Sample rate (8kHz, 16kHz, 22.05kHz, 44.1kHz, 48kHz)
Bit depth (8-bit, 16-bit, 32-bit float)
Mono vs stereo
Audio file formats: WAV, MP3, FLAC, OGG, OPUS
Waveform representation
Time domain vs frequency domain

2. CORE CONCEPTS & WORKING PRINCIPLES

2.1 How Human Speech Works

                    Lungs → Air pressure → Vocal cords vibrate → Resonates in vocal tract →
Articulators shape sound (tongue, lips, teeth) → Acoustic wave → Air → Ear
                

Phonemes: Smallest units of sound (~44 in English)
Prosody: Rhythm, stress, intonation, tempo
Coarticulation: Phonemes influence neighboring sounds
Formants: Resonant frequencies of the vocal tract (F1, F2, F3...)

2.2 Speech-to-Text (STT) — Working Principle

                    Audio Input → Pre-processing → Feature Extraction → Acoustic Model →
Language Model → Decoder → Text Output
                

Step-by-step:

Microphone captures pressure variations → digital signal (waveform)
Pre-process: remove noise, normalize, apply VAD (Voice Activity Detection)
Extract features: convert raw audio to MFCCs, Mel Spectrograms, or raw waveform
Acoustic model: predict phoneme/subword probabilities at each timestep
Language model: rescore sequences based on linguistic probability
Decoder: find most likely word sequence (Viterbi, Beam Search, CTC Greedy)
Post-processing: punctuation restoration, capitalization, speaker labeling

2.3 Text-to-Speech (TTS) — Working Principle

                    Text Input → Text Analysis → Linguistic Features → Acoustic Model →
Vocoder → Audio Waveform Output
                

Step-by-step:

Input text normalization (numbers → words, abbreviations → full form)
G2P (Grapheme-to-Phoneme): convert letters to phonemes
Prosody prediction: duration, pitch, energy per phoneme
Acoustic model: generate mel spectrogram from linguistic features
Vocoder: convert mel spectrogram to raw audio waveform
Post-processing: audio normalization, format encoding

2.4 Key Audio Representations

Representation	Description	Used In
Raw Waveform	Time-domain amplitude samples	WaveNet, WaveGlow, Encodec
STFT Spectrogram	Frequency vs time (complex)	Analysis, source separation
Mel Spectrogram	Perceptually-scaled frequency	Tacotron, Whisper, FastSpeech
MFCC	Compressed mel cepstral coefficients	Traditional ASR, GMM-HMM
Log-Mel	Log of mel spectrogram	Whisper, wav2vec 2.0
Codec Tokens	Discrete audio tokens	EnCodec, SoundStream, VALL-E

3. STRUCTURED LEARNING PATH

PHASE 0: Signal Processing Foundations (4–6 weeks)

Topic 1: Digital Audio Fundamentals
- Sampling and quantization
- Aliasing and anti-aliasing filters
- PCM encoding
- Practice: Load WAV files, plot waveforms with librosa/scipy
Topic 2: Fourier Analysis
- Discrete Fourier Transform (DFT)
- Fast Fourier Transform (FFT) — Cooley-Tukey algorithm
- Short-Time Fourier Transform (STFT)
  - Window size (frame length), hop size, overlap
  - Griffin-Lim reconstruction algorithm
- Practice: Compute STFT, plot spectrograms, reconstruct audio
Topic 3: Mel Scale & Perceptual Features
- Mel filter banks (triangular filters on Mel scale)
- MFCC computation pipeline:
  1. Pre-emphasis filter
  2. Framing + windowing
  3. FFT
  4. Mel filter bank
  5. Log compression
  6. DCT (Discrete Cosine Transform)
- Delta and delta-delta features
- Practice: Implement MFCC from scratch without librosa
Topic 4: Audio Pre-processing
- Noise reduction (spectral subtraction, Wiener filter)
- Voice Activity Detection (VAD) — energy-based, WebRTC VAD, Silero VAD
- Audio normalization (peak, RMS, LUFS)
- Resampling (polyphase filters)
- Practice: Build an audio pre-processing pipeline

PHASE 1: Classical Speech Processing (3–4 weeks)

Topic 5: Hidden Markov Models (HMM)
- Markov chains and state transitions
- HMM components: states, observations, transition matrix, emission matrix
- Three HMM problems:
  1. Evaluation — Forward algorithm
  2. Decoding — Viterbi algorithm
  3. Learning — Baum-Welch (EM algorithm)
- HMM for phoneme modeling
- Practice: Implement HMM for digit recognition
Topic 6: Gaussian Mixture Models (GMM)
- Mixture of Gaussians
- EM algorithm for GMM training
- GMM-HMM acoustic models
- Speaker adaptation: MLLR, MAP adaptation
Topic 7: N-gram Language Models
- Unigram, bigram, trigram
- Perplexity metric
- Smoothing: Laplace, Kneser-Ney, Good-Turing
- ARPA format language model files
Topic 8: Classical Vocoders (TTS)
- Formant synthesis (rule-based)
- Concatenative TTS: unit selection
- STRAIGHT vocoder
- WORLD vocoder (F0 + spectral envelope + aperiodicity)
- Practice: Use WORLD vocoder to analyze and resynthesize speech

PHASE 2: Deep Learning for Speech (6–8 weeks)

Topic 9: Recurrent Neural Networks
- Vanilla RNN and vanishing gradient problem
- LSTM (Long Short-Term Memory):
  - Input gate, forget gate, output gate, cell state
- GRU (Gated Recurrent Unit)
- Bidirectional RNNs
- Practice: Build sequence-to-sequence model for toy TTS
Topic 10: Convolutional Neural Networks for Audio
- 1D convolution for raw waveform
- 2D convolution for spectrograms
- Dilated causal convolutions (key for WaveNet)
- Depthwise separable convolutions
- Practice: Build CNN-based phoneme classifier
Topic 11: Attention Mechanisms
- Dot-product attention
- Scaled dot-product attention
- Multi-head attention
- Self-attention vs cross-attention
- Location-sensitive attention (Tacotron)
- Practice: Implement attention from scratch
Topic 12: Transformer Architecture
- Encoder-Decoder structure
- Positional encoding (sinusoidal, learned, RoPE, ALiBi)
- Feed-forward networks
- Layer normalization (Pre-LN vs Post-LN)
- Masked attention for autoregressive decoding
- Practice: Train a small Transformer on character sequences
Topic 13: Connectionist Temporal Classification (CTC)
- The alignment problem in speech recognition
- CTC forward algorithm
- CTC loss and gradient
- CTC greedy and beam search decoding
- CTC + language model rescoring
- Practice: Train CTC model on TIMIT dataset

PHASE 3: Modern STT Systems (8–10 weeks)

Topic 14: End-to-End ASR Architectures
- Listen, Attend and Spell (LAS)
- Deep Speech 1 & 2 (Baidu)
- Jasper, QuartzNet (NVIDIA)
- Conformer (combining CNN + Transformer)
- Architecture comparison: CTC vs Attention vs RNN-T
Topic 15: Self-Supervised Learning for Speech
- Contrastive Predictive Coding (CPC)
- wav2vec / wav2vec 2.0 (Facebook/Meta)
  - CNN feature encoder + Transformer context network
  - Quantization module (product quantization)
  - Contrastive loss with negative sampling
- HuBERT (Hidden Unit BERT)
  - Offline clustering → pseudo-label generation
  - BERT-style masked prediction
- WavLM: wav2vec 2.0 + denoising objective
- Practice: Fine-tune wav2vec 2.0 on custom dataset
Topic 16: Whisper (OpenAI)
- Architecture: Encoder-Decoder Transformer
- Training data: 680,000 hours weakly supervised
- Input: 30-second log-Mel spectrogram (80 channels)
- Multitask training: transcription + translation + language ID + VAD
- Tokenizer: BPE with multilingual vocabulary
- Model sizes: tiny(39M), base(74M), small(244M), medium(769M), large(1.5B)
- Practice: Deploy Whisper, fine-tune on domain-specific data
Topic 17: RNN-T (Recurrent Neural Network Transducer)
- Encoder (audio) + Prediction network (text) + Joint network
- Transducer loss function
- On-device streaming ASR
- Used by: Google, Apple, Amazon Alexa
- Practice: Train small RNN-T on LibriSpeech subset
Topic 18: Streaming & Real-Time ASR
- Chunk-based processing
- Latency vs accuracy tradeoff
- Lookahead context
- Cache-aware streaming Conformer
- CTC prefix beam search for streaming
- Practice: Build real-time transcription with WebRTC + Whisper

PHASE 4: Modern TTS Systems (8–10 weeks)

Topic 19: Neural TTS Pipeline
- Text normalization (written → spoken form)
  - Number normalization
  - Abbreviation expansion
  - Date/time normalization
- G2P (Grapheme-to-Phoneme):
  - Rule-based (CMU Pronouncing Dictionary)
  - Sequence-to-sequence G2P
  - Transformer G2P
- Phoneme inventory and IPA
- Prosody: F0 (pitch), duration, energy
Topic 20: Tacotron & Tacotron 2
- Tacotron 1: CBHG + attention + Griffin-Lim
- Tacotron 2:
  - Encoder: Conv layers + BiLSTM
  - Attention: Location-sensitive
  - Decoder: Autoregressive LSTM → mel spectrogram
  - Stop token prediction
  - WaveNet vocoder
- Practice: Train Tacotron 2 on LJ Speech dataset
Topic 21: FastSpeech & FastSpeech 2
- FastSpeech 1: Knowledge distillation from autoregressive teacher
  - Feed-forward Transformer (FFT)
  - Length regulator (phoneme duration)
  - Parallel mel generation (non-autoregressive)
- FastSpeech 2: No teacher-forcing
  - Duration predictor
  - Pitch predictor (F0)
  - Energy predictor
  - Variance adaptor
- Speed: 270x faster than Tacotron for inference
- Practice: Train FastSpeech 2 on LJ Speech
Topic 22: VITS (Variational Inference TTS)
- End-to-end: text → waveform in one model
- Components: posterior encoder, prior encoder, decoder (HiFi-GAN)
- Variational autoencoder (VAE) latent space
- Normalizing flows (affine coupling layers)
- GAN training for waveform quality
- Stochastic duration predictor
- Practice: Train VITS, experiment with fine-tuning on custom voice
Topic 23: Neural Vocoders
- WaveNet: Autoregressive dilated causal CNN, slow but high quality
- WaveGlow: Normalizing flow, parallel generation
- MelGAN: GAN-based, fast, lightweight
- HiFi-GAN: Multi-period discriminator + multi-scale discriminator, best quality/speed
- BigVGAN: Large-scale HiFi-GAN with anti-aliased activations
- EnCodec: Neural audio codec (RVQ-based), used as tokenizer
- Practice: Train HiFi-GAN on LJ Speech
Topic 24: Voice Cloning
- Speaker embeddings: d-vector, x-vector, ECAPA-TDNN
- Speaker verification vs identification
- Zero-shot voice cloning: YourTTS, XTTS, OpenVoice
- Few-shot voice cloning: 3–10 seconds of reference audio
- Speaker encoder: GE2E loss (generalized end-to-end loss)
- Practice: Implement zero-shot voice cloning with XTTS

PHASE 5: Large-Scale Models & Advanced Techniques (8–12 weeks)

Topic 25: Language Models for TTS & STT
- VALL-E: TTS as a language modeling task
  - EnCodec tokens (8 RVQ levels)
  - AR model for coarse tokens + NAR for fine tokens
  - In-context learning for voice cloning
- AudioLM: Audio continuation using hierarchical tokens
- SoundStorm: Non-autoregressive audio generation
- Voicebox: Flow-matching-based TTS
Topic 26: Diffusion Models for Speech
- Score-based generative models
- Denoising Diffusion Probabilistic Models (DDPM)
- DiffWave: diffusion-based vocoder
- Grad-TTS: diffusion-based acoustic model
- Stable Diffusion concepts applied to audio
- DDIM sampling for fast inference
Topic 27: Flow Matching
- Continuous normalizing flows
- Flow matching vs diffusion: faster training, ODE-based
- Voicebox (Meta): flow matching for TTS
- Matcha-TTS: ODE-based TTS
- E2-TTS / F5-TTS: flow matching with flat text input
Topic 28: Multilingual & Code-Switching
- Multilingual acoustic models
- Language identification integration
- Code-switching (mixing languages mid-sentence)
- MMS (Meta Massively Multilingual Speech): 1000+ languages
- Cross-lingual transfer learning
- Low-resource language adaptation
Topic 29: Emotion & Style Control
- Emotion embeddings (happy, sad, angry, neutral...)
- Global Style Tokens (GST)
- Reference audio-based style transfer
- Prosody transfer
- Voice conversion (change voice, keep content)
- Practice: Build emotion-controlled TTS using GST-Tacotron

PHASE 6: Production & MLOps (4–6 weeks)

Topic 30: Model Optimization
- Quantization: INT8, INT4, dynamic quantization
- Pruning: structured, unstructured, magnitude-based
- Knowledge distillation for smaller models
- ONNX export and ONNX Runtime
- TensorRT optimization (NVIDIA)
- OpenVINO (Intel)
- Edge deployment: TFLite, CoreML, NCNN
Topic 31: Inference Optimization
- Batching strategies (dynamic batching)
- Caching (KV cache, encoder cache)
- Speculative decoding
- CTranslate2 for faster Transformer inference
- Triton Inference Server
- TorchScript and torch.compile
Topic 32: Service Architecture
- REST API design (FastAPI, Flask)
- WebSocket for real-time streaming
- gRPC for high-performance RPC
- Message queues (RabbitMQ, Kafka) for async processing
- Load balancing and horizontal scaling
- Rate limiting and API key management
- CDN for audio delivery
Topic 33: MLOps Pipeline
- Experiment tracking: MLflow, Weights & Biases
- Data versioning: DVC
- Model registry and versioning
- CI/CD for ML models
- Monitoring: model drift, latency, error rate
- A/B testing for TTS quality
- Data flywheel and continuous improvement

4. ALGORITHMS, TECHNIQUES & TOOLS

4.1 Core Algorithms

STT Algorithms

Algorithm	Type	Key Use
Viterbi	Dynamic Programming	HMM decoding, best path
Baum-Welch	EM	HMM training
CTC Forward-Backward	DP	CTC loss computation
Beam Search	Tree Search	Sequence decoding
Prefix Beam Search	Tree Search	CTC with LM integration
WFST (Weighted FST)	Graph	Kaldi-style decoding
BPE (Byte Pair Encoding)	Tokenization	Subword vocabulary
Word2Vec/FastText	Embedding	Text representation
Forced Alignment	DP	Aligning audio to transcripts

TTS Algorithms

Algorithm	Type	Key Use
Griffin-Lim	Phase reconstruction	Spectrogram → waveform
WORLD vocoder	Signal processing	Parametric voice synthesis
VAE	Generative	Latent space for style
Normalizing Flows	Generative	Invertible transformations
GAN	Generative	Waveform generation, vocoders
DDPM	Generative	Diffusion vocoders
Flow Matching	Generative	Fast TTS (F5-TTS, Voicebox)
RVQ (Residual Vector Quantization)	Compression	Audio tokenization

4.2 Neural Network Architectures

CNN: WaveNet, DeepSpeech, Jasper, QuartzNet
LSTM/GRU: Tacotron, early E2E ASR
Transformer: Whisper, FastSpeech, wav2vec 2.0
Conformer: SOTA for ASR (CNN + Self-attention hybrid)
Diffusion U-Net: DiffWave, Grad-TTS
Flow network: WaveGlow, Glow-TTS, VITS
Codec model: EnCodec, SoundStream, DAC

4.3 Training Techniques

Teacher Forcing: train decoder with ground truth
Scheduled Sampling: gradually mix teacher/model predictions
Knowledge Distillation: teacher-student training
Contrastive Learning: wav2vec, SimCLR-style
Multi-task Learning: Whisper (transcription + translation + LID)
Transfer Learning: fine-tune pretrained models
Data Augmentation:
- SpecAugment (time/frequency masking)
- Speed perturbation (0.9x, 1.0x, 1.1x)
- Room Impulse Response (RIR) convolution
- Additive noise (MUSAN, AudioSet)
- Pitch shifting, time stretching

4.4 Python Libraries & Frameworks

Audio Processing

                    librosa        — Audio analysis, feature extraction, visualization
soundfile      — Read/write audio files (WAV, FLAC, OGG)
pydub          — Audio manipulation (cut, join, convert)
scipy.signal   — Signal processing primitives
torchaudio     — PyTorch audio I/O and transforms
audioread      — Backend-agnostic audio reading
pyworld        — Python wrapper for WORLD vocoder
resampy        — High-quality audio resampling
webrtcvad      — Google's WebRTC VAD
silero-vad     — Neural VAD (accurate, fast)
                

Deep Learning

                    PyTorch        — Primary framework for research/production
TensorFlow     — Production, mobile (TFLite)
JAX/Flax       — Google research framework
HuggingFace Transformers — Pre-trained models hub
HuggingFace Datasets     — Dataset loading/processing
                

STT-Specific

                    openai-whisper        — OpenAI Whisper (all sizes)
faster-whisper        — CTranslate2-optimized Whisper (4x faster)
whisperx              — Whisper + word-level alignment
nemo (NVIDIA NeMo)    — ASR, TTS, NLP toolkit
espnet                — End-to-end speech processing
kaldi                 — Classical + hybrid ASR
speechbrain           — PyTorch speech toolkit
wav2letter++          — Meta's ASR toolkit
deepgram              — Commercial STT API (also research)
vosk                  — Offline STT (lightweight)
                

TTS-Specific

                    TTS (Coqui)           — Open-source TTS: Tacotron, VITS, XTTS
espeak-ng             — Lightweight rule-based TTS (G2P)
pyttsx3               — Offline TTS wrapper
bark                  — Suno's generative TTS (GPT-style)
tortoise-tts          — Slow but high-quality multi-voice TTS
XTTS / Coqui XTTS     — Multilingual voice cloning (VITS-based)
StyleTTS2             — Style-based TTS (SOTA on LJ Speech)
parler-tts            — Description-controlled TTS
kokoro-tts            — Lightweight high-quality TTS
                

Serving & Infrastructure

                    FastAPI       — Async Python web framework
uvicorn       — ASGI server
triton        — NVIDIA model serving
onnxruntime   — Cross-platform model inference
ctranslate2   — Efficient Transformer inference
ray serve     — Distributed model serving
celery        — Async task queue
redis         — Caching, pub/sub, queue
                

5. ARCHITECTURE DEEP DIVE

5.1 STT Architecture Family Tree

                    Classical Era
├── GMM-HMM (1990s–2010s)
│   ├── Feature: MFCC
│   ├── Acoustic: GMM per HMM state
│   └── Decoder: Viterbi + N-gram LM
│
├── DNN-HMM (2012–2016)
│   ├── Feature: MFCC / fbank
│   ├── Acoustic: DNN replaces GMM
│   └── Decoder: Viterbi + WFST
│
End-to-End Era
├── CTC-Based (2014–2019)
│   ├── DeepSpeech 1 & 2: RNN + CTC
│   ├── Jasper: CNN + CTC
│   └── QuartzNet: Depthwise sep CNN + CTC
│
├── Attention-Based (2016–2020)
│   ├── LAS: LSTM encoder + attention decoder
│   └── Transformer ASR: Self-attention encoder + decoder
│
├── Hybrid CTC-Attention (2017–present)
│   └── ESPnet models, Conformer
│
Self-Supervised Era
├── wav2vec 2.0 (2020): CNN + Transformer + contrastive
├── HuBERT (2021): CNN + Transformer + BERT-style
├── WavLM (2022): HuBERT + denoising
└── Whisper (2022): Supervised multitask, Enc-Dec Transformer
│
Streaming / On-device
├── RNN-T: encoder + predictor + joiner
├── Streaming Conformer: chunk-based
└── Distil-Whisper: 6x faster distilled version
                

5.2 Conformer Architecture (SOTA for ASR)

                    Input Audio → Log-Mel Spectrogram (80 dims) → Conv Subsampling (4x)
→ Linear Projection → [Conformer Block × N] → CTC / Attention Head

Conformer Block:
  Input
    ↓
  Feed-Forward Module (½ scaling)
    ↓
  Multi-Head Self-Attention Module
    ↓
  Convolution Module (depthwise)
    ↓
  Feed-Forward Module (½ scaling)
    ↓
  LayerNorm
    ↓
  Output

Convolution Module:
  LayerNorm → Pointwise Conv → GLU → Depthwise Conv →
  BatchNorm → Swish activation → Pointwise Conv → Dropout
                

5.3 Whisper Architecture Detail

                    Encoder:
  Log-Mel Spectrogram (80 × 3000 frames for 30s)
  → 2× Conv1D (stride 1, 2) + GELU
  → Sinusoidal Positional Encoding
  → Transformer Encoder Blocks (6–32 layers depending on model)
    Each block: Self-Attention + FFN + LayerNorm (pre-norm)

Decoder:
  Special tokens: <|startoftranscript|> <|language|> <|task|> <|notimestamps|>
  → Token Embedding + Learned Positional Encoding
  → Transformer Decoder Blocks (6–32 layers)
    Each block: Masked Self-Attention + Cross-Attention + FFN
  → Linear → Softmax over vocab (51865 tokens)
                

5.4 TTS Architecture Family Tree

                    Classical Era
├── Formant synthesis (rule-based, 1960s–)
├── Concatenative TTS (unit selection, 1990s–)
│   └── Record many hours → select and concatenate units
└── HMM-based TTS (HTS, 2000s–)
    └── STRAIGHT/WORLD vocoder

Neural Era
├── Seq2Seq + Attention
│   ├── Tacotron 1 (2017): CBHG + Griffin-Lim
│   └── Tacotron 2 (2017): BiLSTM + WaveNet vocoder
│
├── Parallel / Non-autoregressive
│   ├── FastSpeech 1 (2019): FFT + duration from teacher
│   ├── FastSpeech 2 (2020): Duration/pitch/energy predictors
│   ├── SpeedySpeech (2020)
│   └── JETS (2022): E2E with alignment learning
│
├── Normalizing Flow Based
│   ├── Glow-TTS (2020): Flow-based alignment + generation
│   └── VITS (2021): E2E VAE + flows + HiFi-GAN
│
├── Diffusion Based
│   ├── DiffTTS (2021)
│   ├── Grad-TTS (2021): Score-based diffusion
│   └── NaturalSpeech (2022): VITS + diffusion
│
LLM/Codec Era (2023–present)
├── VALL-E (2023): AR + NAR codec language model
├── SPEAR-TTS (2023): Self-supervised TTS
├── Voicebox (2023): Flow matching
├── NaturalSpeech 3 (2024): FACodec + diffusion
├── F5-TTS (2024): Flow matching + Flat text
└── CosyVoice (2024): LLM + flow matching
                

5.5 VITS Architecture Detail (Recommended Starting Point)

                    TEXT INPUT
    ↓
[Text Encoder]
  Phoneme embedding → Transformer encoder → Prior distribution μ,σ

[Stochastic Duration Predictor]
  Flow-based duration prediction

[Length Regulator]
  Expand phoneme representations to frame length

[Decoder / Flow-based Posterior]
  VAE encoder: mel → latent z
  Normalizing flows: transforms z
  
[HiFi-GAN Generator] (Vocoder)
  z → raw waveform
  
[Discriminators] (training only)
  Multi-Period Discriminator (MPD)
  Multi-Scale Discriminator (MSD)
  
LOSS = Mel loss + KL divergence + Duration loss + GAN loss + Feature matching loss
                

5.6 HiFi-GAN Architecture Detail

                    Generator:
  Input: Mel Spectrogram (80 × T)
  → Transposed Conv (×4 upsample) → MRF Block → repeat until audio rate
  MRF Block = Multi-Receptive Field Fusion
    = ResBlock(k=3) + ResBlock(k=7) + ResBlock(k=11)
    Each ResBlock: dilated conv with rates [1,3,5]
  Output: Raw waveform at 22050Hz

Multi-Period Discriminator (MPD):
  Periods p = [2, 3, 5, 7, 11]
  Reshape waveform into (T/p, p) → Conv2D per period

Multi-Scale Discriminator (MSD):
  Operate at 3 scales: raw, ×2 avg pooled, ×4 avg pooled
                

6. DESIGN & DEVELOPMENT PROCESS

6.1 STT Development from Scratch

Step 1: Data Collection & Preparation

                    Sources:
  - LibriSpeech: 960h clean English (openslr.org)
  - CommonVoice: Mozilla multilingual crowdsourced
  - VoxPopuli: EU parliament recordings
  - FLEURS: Google multilingual
  - Custom: Record, transcribe, verify

Pipeline:
  raw_audio → segment_by_vad → normalize_loudness →
  resample_to_16kHz → verify_transcript → create_manifest_json
  
Manifest format:
  {"audio_filepath": "path/to/audio.wav", "duration": 3.2, "text": "hello world"}
                

Step 2: Feature Extraction

                    import torchaudio
import torchaudio.transforms as T

def extract_mel_spectrogram(waveform, sample_rate=16000):
    mel_transform = T.MelSpectrogram(
        sample_rate=sample_rate,
        n_fft=400,          # ~25ms window at 16kHz
        hop_length=160,     # ~10ms hop
        n_mels=80,
        f_min=80,
        f_max=7600
    )
    log_mel = torch.log(mel_transform(waveform) + 1e-9)
    return log_mel  # Shape: (80, T)
                

Step 3: Model Architecture (Conformer CTC)

                    class ConformerASR(nn.Module):
    def __init__(self, input_dim=80, vocab_size=29, d_model=256, num_heads=4, num_layers=6):
        super().__init__()
        self.conv_subsample = Conv2dSubsampling(input_dim, d_model)
        self.encoder = nn.ModuleList([
            ConformerBlock(d_model, num_heads) for _ in range(num_layers)
        ])
        self.ctc_head = nn.Linear(d_model, vocab_size)
    
    def forward(self, x, x_lengths):
        x, x_lengths = self.conv_subsample(x, x_lengths)
        for block in self.encoder:
            x = block(x)
        logits = self.ctc_head(x)
        return logits, x_lengths
                

Step 4: Training Loop

                    from torch.nn import CTCLoss

criterion = CTCLoss(blank=0, zero_infinity=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

for batch in dataloader:
    audio, audio_len, tokens, token_len = batch
    logits, out_len = model(audio, audio_len)
    log_probs = F.log_softmax(logits.transpose(0,1), dim=-1)
    loss = criterion(log_probs, tokens, out_len, token_len)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    optimizer.zero_grad()
                

Step 5: Decoding

                    # Greedy CTC decoding
def greedy_decode(logits, blank_id=0):
    predicted = torch.argmax(logits, dim=-1)
    decoded = []
    prev = blank_id
    for p in predicted:
        if p != blank_id and p != prev:
            decoded.append(p.item())
        prev = p
    return decoded

# Beam search with language model (use pyctcdecode)
from pyctcdecode import build_ctcdecoder
decoder = build_ctcdecoder(vocab, kenlm_model="lm.arpa", alpha=0.5, beta=1.0)
text = decoder.decode(logits.numpy())
                

Step 6: Evaluation

                    from jiwer import wer, cer

# Word Error Rate
error_rate = wer(reference_texts, hypothesis_texts)
char_error = cer(reference_texts, hypothesis_texts)
print(f"WER: {error_rate:.2%}, CER: {char_error:.2%}")
                

6.2 TTS Development from Scratch

Step 1: Data Collection & Preparation

                    Datasets:
  - LJ Speech: 24h single speaker English (ljspeech.github.io)
  - VCTK: 109 English speakers
  - LibriTTS: 585h multi-speaker
  - HiFi-TTS: High quality multi-speaker
  - Custom: Studio-quality recording (quiet room, good mic)

Recording specs for custom:
  - 44.1kHz or 48kHz, 24-bit, mono
  - Acoustic treatment (no echo/reverb)
  - Consistent mic distance (15–20cm)
  - Phonetically balanced scripts
  - 1–10 hours for fine-tuning; 20+ for training from scratch

Preprocessing:
  audio → normalize_to_-20dBFS → trim_silence → resample_22050Hz →
  extract_mel → create_filelists (train|val|test)
                

Step 2: Text Frontend

                    import phonemizer
from phonemizer.backend import EspeakBackend

backend = EspeakBackend('en-us', preserve_punctuation=True, with_stress=True)

def text_to_phonemes(text):
    # Normalize text first
    text = normalize_numbers(text)  # "123" → "one hundred twenty three"
    text = expand_abbreviations(text)  # "Dr." → "Doctor"
    # Convert to phonemes
    phonemes = backend.phonemize([text])[0]
    return phonemes

# Phoneme to ID mapping
phoneme_to_id = {p: i for i, p in enumerate(PHONEME_SET)}
                

Step 3: FastSpeech 2 Model

                    class FastSpeech2(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = FFTTransformer(n_layers=4, d_model=256, n_heads=2)
        self.variance_adaptor = VarianceAdaptor(d_model=256)
        self.decoder = FFTTransformer(n_layers=4, d_model=256, n_heads=2)
        self.mel_linear = nn.Linear(256, 80)
        
    def forward(self, phoneme_ids, duration_target=None, pitch_target=None, energy_target=None):
        x = self.encoder(phoneme_ids)
        x, duration, pitch, energy = self.variance_adaptor(
            x, duration_target, pitch_target, energy_target
        )
        x = self.decoder(x)
        mel = self.mel_linear(x)
        return mel, duration, pitch, energy
                

Step 4: HiFi-GAN Vocoder Training

                    # Train HiFi-GAN on mel → waveform
# Generator loss
mel_loss = F.l1_loss(mel_fake, mel_real)
gan_loss = generator_adversarial_loss(disc_fake_outputs)
feature_match = feature_matching_loss(disc_real_features, disc_fake_features)
loss_G = mel_loss * 45 + gan_loss + feature_match * 2

# Discriminator loss  
loss_D = discriminator_loss(disc_real_outputs, disc_fake_outputs)
                

Step 5: End-to-End Inference Pipeline

                    class TTSPipeline:
    def __init__(self, tts_model, vocoder):
        self.tts = tts_model
        self.vocoder = vocoder
        
    def synthesize(self, text, speed=1.0):
        # 1. Text → phonemes
        phonemes = text_to_phonemes(text)
        phoneme_ids = text_to_ids(phonemes)
        
        # 2. TTS model → mel spectrogram
        with torch.no_grad():
            mel, *_ = self.tts(
                torch.LongTensor(phoneme_ids).unsqueeze(0),
                d_control=speed
            )
        
        # 3. Vocoder → waveform
        with torch.no_grad():
            audio = self.vocoder(mel)
        
        return audio.squeeze().cpu().numpy()
                

7. REVERSE ENGINEERING EXISTING SYSTEMS

7.1 Why Reverse Engineering?

Learn from production-grade code
Understand design decisions
Identify optimizations for your use case
Build intuition faster than pure theory

7.2 How to Reverse Engineer Whisper

Step 1: Read the Paper

"Robust Speech Recognition via Large-Scale Weak Supervision" (Radford et al. 2022)
Note: architecture section, training details, data section

Step 2: Clone and Explore Code

                    git clone https://github.com/openai/whisper
# Key files:
# whisper/model.py — Architecture
# whisper/audio.py — Feature extraction  
# whisper/decoding.py — Beam search decoder
# whisper/tokenizer.py — BPE tokenizer
                

Step 3: Trace Forward Pass

                    import whisper
model = whisper.load_model("tiny")

# Trace: audio → features
audio = whisper.load_audio("speech.wav")
mel = whisper.log_mel_spectrogram(audio)  # (80, 3000)

# Encoder
encoded = model.encoder(mel.unsqueeze(0))  # (1, 1500, 384) for tiny

# Decoder (autoregressive)
tokens = [model.tokenizer.sot]  # start of transcript token
for _ in range(100):
    logits = model.decoder(torch.tensor([tokens]), encoded)
    next_token = logits[0, -1].argmax().item()
    if next_token == model.tokenizer.eot:
        break
    tokens.append(next_token)
                

Step 4: Profile Bottlenecks

                    import torch.profiler
with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    result = model.transcribe("audio.wav")
print(prof.key_averages().table(sort_by="cuda_time_total"))
# Identify: encoder dominates (80%), decoder is 20%
# Optimization: cache encoder, optimize decoder attention
                

Step 5: Rebuild from Scratch (Your Understanding)

                    # After studying, rebuild each component:
class MultiHeadAttention(nn.Module):
    # Implement from scratch based on understanding
    ...

class ResidualAttentionBlock(nn.Module):
    # Implement encoder block
    ...

# Compare outputs to original model
torch.testing.assert_close(your_output, original_output, atol=1e-4, rtol=1e-4)
                

7.3 How to Reverse Engineer VITS

                    git clone https://github.com/jaywalnut310/vits
# Key files:
# models.py — SynthesizerTrn (full model)
# attentions.py — Transformer blocks
# modules.py — WN (WaveNet-style), ResidualCouplingBlock (flows)
# monotonic_align/ — MAS (Monotonic Alignment Search)
# mel_processing.py — Mel spectrogram computation
                

Key insight from VITS code:

SynthesizerTrn.infer() is inference path (no VAE encoder needed)
SynthesizerTrn.forward() is training path (requires mel as target)
monotonic_align.maximum_path() is the alignment algorithm (Cython)

7.4 Reverse Engineering Approach Template

READ paper abstract + architecture section → mental model
CLONE repository → understand file structure
TRACE data flow (print shapes at each step)
IDENTIFY key components → isolate each into test
REPRODUCE in clean code from memory
VERIFY outputs match original
EXPERIMENT: change hyperparameters, observe effects
OPTIMIZE: profile, identify bottlenecks, improve

8. HARDWARE REQUIREMENTS

8.1 Development Hardware

Minimum (Learning & Experimentation)

                    CPU:    Intel Core i7 / AMD Ryzen 7 (8+ cores)
RAM:    16GB (32GB preferred)
GPU:    NVIDIA RTX 3060 (12GB VRAM) or RTX 3070
Storage: 500GB SSD (NVMe preferred)
Note:  Can fine-tune small models, run inference on all models
                

Recommended (Training Medium Models)

                    CPU:    Intel Core i9 / AMD Ryzen 9 / Threadripper
RAM:    64GB DDR4/DDR5
GPU:    NVIDIA RTX 3090 (24GB) or RTX 4090 (24GB) — single GPU
Storage: 2TB NVMe SSD + 4TB HDD for datasets
Cost:   ~$3,000–$5,000
Note:  Train FastSpeech2, HiFi-GAN, Conformer from scratch on LJ Speech
                

Research-Grade (Training Large Models)

                    GPU:    4× RTX 4090 or 4× A100 (40GB or 80GB)
RAM:    256GB
Storage: 10TB+ NVMe
Network: 100GbE for distributed training
Cost:   $15,000–$40,000
Note:  Train VITS, Conformer on LibriSpeech 960h
                

Cloud (Production Training)

                    AWS:
  p3.2xlarge  — 1× V100 16GB  ($3.06/hr)
  p3.8xlarge  — 4× V100 64GB  ($12.24/hr)
  p4d.24xlarge — 8× A100 40GB ($32.77/hr)

Google Cloud:
  a2-highgpu-1g  — 1× A100 40GB ($3.67/hr)
  a2-highgpu-8g  — 8× A100 40GB ($29.39/hr)

Lambda Labs (cheapest GPU cloud):
  1× A100 80GB  ~$1.29/hr
  8× A100 80GB  ~$10.32/hr

Use Spot/Preemptible instances for ~60-70% discount
                

8.2 VRAM Requirements by Model

Model	Task	VRAM (Training)	VRAM (Inference)
Conformer-S (10M)	ASR	8GB	<1GB
Conformer-M (30M)	ASR	12GB	2GB
Whisper Large v3	ASR	— (pretrained)	6GB
FastSpeech 2	TTS	8GB	<1GB
VITS	TTS	12GB	2GB
HiFi-GAN	Vocoder	8GB	<1GB
VALL-E style	TTS	40GB+	8GB+
Whisper large fine-tune	ASR	24GB	6GB

8.3 Production Inference Hardware

CPU-Only (Lightweight)

                    Use case: Low-traffic, edge, embedded
Hardware: Modern x86 CPU with AVX2
Tools: ONNX Runtime, OpenVINO, CTranslate2 CPU
Models: Whisper tiny/base, Kokoro TTS
Latency: 1–5x real-time (RTF > 1)
                

GPU Server (Production)

                    Use case: High-traffic API service
Hardware: NVIDIA T4 ($0.35/hr on AWS), A10G, RTX 4090
Tools: Triton Server, TensorRT, CTranslate2 GPU
Models: Whisper large, VITS, XTTS
Latency: 0.1–0.3x real-time (RTF 0.1–0.3)
                

Edge Devices

                    NVIDIA Jetson Orin: On-device AI, 16-64GB unified memory
Apple Silicon M2/M3: Metal GPU, excellent for CoreML models
Raspberry Pi 5: Light STT only (Vosk, Whisper tiny)
Android/iOS: TFLite, ONNX Mobile models
                

9. BUILDING YOUR OWN SERVICE

9.1 System Architecture Overview

                                             ┌─────────────────────────────────────┐
                         │           CLIENT LAYER               │
                         │  Web App | Mobile | API Consumer     │
                         └──────────────────┬──────────────────┘
                                            │ HTTPS / WebSocket
                         ┌──────────────────▼──────────────────┐
                         │            API GATEWAY               │
                         │  Rate Limiting | Auth | Load Balance │
                         └──────┬──────────────────┬───────────┘
                                │                  │
              ┌─────────────────▼──┐    ┌──────────▼──────────────┐
              │    STT Service      │    │     TTS Service          │
              │  FastAPI + Whisper  │    │  FastAPI + VITS/XTTS    │
              └─────────┬──────────┘    └──────────┬──────────────┘
                        │                          │
              ┌─────────▼──────────────────────────▼──────────────┐
              │                INFERENCE LAYER                      │
              │     GPU Workers (Triton / CTranslate2 / ONNX)      │
              └─────────┬──────────────────────────┬──────────────┘
                        │                          │
              ┌─────────▼───────┐      ┌───────────▼───────────┐
              │  Message Queue  │      │    Model Registry      │
              │  (Redis/Kafka)  │      │    (MLflow / S3)       │
              └─────────────────┘      └───────────────────────┘
              ┌─────────────────────────────────────────────────┐
              │              STORAGE LAYER                       │
              │  Audio Storage (S3/GCS) | DB (PostgreSQL)        │
              └─────────────────────────────────────────────────┘
                

9.2 STT Service Implementation

FastAPI STT Service

                    from fastapi import FastAPI, UploadFile, File, WebSocket
from fastapi.responses import JSONResponse
import torchaudio
import io
from faster_whisper import WhisperModel

app = FastAPI(title="STT Service")

# Load model at startup
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")

@app.post("/transcribe")
async def transcribe(
    file: UploadFile = File(...),
    language: str = "en",
    task: str = "transcribe"  # or "translate"
):
    # Read uploaded audio
    audio_bytes = await file.read()
    audio_buffer = io.BytesIO(audio_bytes)
    
    # Transcribe
    segments, info = model.transcribe(
        audio_buffer,
        language=language,
        task=task,
        beam_size=5,
        word_timestamps=True
    )
    
    result = {
        "language": info.language,
        "language_probability": info.language_probability,
        "duration": info.duration,
        "segments": [
            {
                "start": s.start,
                "end": s.end,
                "text": s.text,
                "words": [{"word": w.word, "start": w.start, "end": w.end} 
                          for w in (s.words or [])]
            }
            for s in segments
        ]
    }
    return JSONResponse(result)

@app.websocket("/stream")
async def stream_transcribe(websocket: WebSocket):
    await websocket.accept()
    # Streaming implementation
    buffer = b""
    async for data in websocket.iter_bytes():
        buffer += data
        if len(buffer) >= 32000 * 2:  # 2 seconds of 16kHz int16
            # Process chunk
            audio = np.frombuffer(buffer, dtype=np.int16).astype(np.float32) / 32768.0
            segments, _ = model.transcribe(audio, language="en")
            for seg in segments:
                await websocket.send_json({"text": seg.text, "final": False})
            buffer = b""
                

Dockerized STT Service

                    FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3-pip ffmpeg
WORKDIR /app
COPY requirements.txt .
RUN pip install faster-whisper fastapi uvicorn python-multipart
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
                

9.3 TTS Service Implementation

                    from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from TTS.api import TTS
import io
import soundfile as sf
import numpy as np

app = FastAPI(title="TTS Service")

# Load model
tts = TTS("tts_models/en/ljspeech/vits", gpu=True)

@app.post("/synthesize")
async def synthesize(
    text: str,
    speaker_id: int = 0,
    speed: float = 1.0,
    format: str = "wav"
):
    # Generate audio
    wav = tts.tts(text=text, speaker=speaker_id, speed=speed)
    
    # Convert to bytes
    buffer = io.BytesIO()
    sf.write(buffer, np.array(wav), 22050, format=format.upper())
    buffer.seek(0)
    
    return StreamingResponse(
        buffer,
        media_type=f"audio/{format}",
        headers={"Content-Disposition": f"attachment; filename=speech.{format}"}
    )

@app.post("/synthesize/stream")
async def synthesize_stream(text: str):
    """Stream audio chunks as they're generated"""
    async def generate():
        for sentence in split_into_sentences(text):
            wav = tts.tts(text=sentence)
            audio_bytes = wav_to_bytes(wav)
            yield audio_bytes
    
    return StreamingResponse(generate(), media_type="audio/wav")
                

9.4 Voice Cloning Service

                    # Using XTTS for zero-shot voice cloning
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
config.load_json("XTTS-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="XTTS-v2/", eval=True)
model.cuda()

@app.post("/clone")
async def clone_voice(
    text: str,
    reference_audio: UploadFile = File(...),
    language: str = "en"
):
    # Save reference audio temporarily
    ref_bytes = await reference_audio.read()
    ref_path = f"/tmp/{uuid.uuid4()}.wav"
    with open(ref_path, "wb") as f:
        f.write(ref_bytes)
    
    # Compute speaker latents
    gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
        audio_path=[ref_path]
    )
    
    # Synthesize
    out = model.inference(
        text=text,
        language=language,
        gpt_cond_latent=gpt_cond_latent,
        speaker_embedding=speaker_embedding,
        temperature=0.7
    )
    
    buffer = io.BytesIO()
    sf.write(buffer, out["wav"], 24000, format="WAV")
    buffer.seek(0)
    
    return StreamingResponse(buffer, media_type="audio/wav")
                

9.5 Deployment with Docker Compose

                    version: '3.8'
services:
  stt-service:
    build: ./stt
    ports: ["8001:8000"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ./models:/models
    environment:
      - MODEL_PATH=/models/whisper-large-v3
  
  tts-service:
    build: ./tts
    ports: ["8002:8000"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ./models:/models
  
  nginx:
    image: nginx:alpine
    ports: ["80:80", "443:443"]
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./certs:/etc/ssl/certs
    depends_on: [stt-service, tts-service]
  
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
  
  prometheus:
    image: prom/prometheus
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
                

10. BUILD IDEAS: BEGINNER → ADVANCED

🟢 BEGINNER LEVEL (0–3 months)

Project 1: Voice Recorder + Transcriber

BEGINNER

                        Stack: Python + OpenAI Whisper + sounddevice
- Record from microphone with sounddevice
- Save as WAV file
- Pass to Whisper for transcription
- Save transcript as text file
Learning: Audio I/O, Whisper API, file handling
                    

Project 2: Text-to-Speech Converter

BEGINNER

                        Stack: Python + Coqui TTS or pyttsx3
- Accept text input via CLI
- Generate speech audio
- Play back or save to file
- Support multiple voices
Learning: TTS API, audio output, voice selection
                    

Project 3: Meeting Transcriber

BEGINNER

                        Stack: Python + Whisper + pyaudio + tkinter
- Simple GUI with start/stop recording
- Real-time transcription display
- Export to .txt or .docx
- Speaker diarization (basic)
Learning: GUI, audio streaming, file export
                    

Project 4: Language Learning Pronunciation Checker

BEGINNER

                        Stack: Python + Whisper + phonemizer
- User reads a sentence aloud
- Compare recognized phonemes to expected phonemes
- Score pronunciation accuracy
- Highlight mispronounced words
Learning: Phoneme comparison, scoring, feedback
                    

🟡 INTERMEDIATE LEVEL (3–8 months)

Project 5: Real-Time Transcription Web App

INTERMEDIATE

                        Stack: FastAPI + WebSocket + Whisper + React
- Browser captures microphone stream (MediaRecorder API)
- Streams audio chunks to WebSocket server
- Server transcribes chunks with streaming Whisper
- Live captions displayed in browser
- Export transcript feature
Learning: WebSockets, browser audio API, streaming inference
                    

Project 6: Podcast TTS Generator

INTERMEDIATE

                        Stack: FastSpeech2 + HiFi-GAN + FastAPI + React
- Input: article URL or text
- Extract text from URL (newspaper3k)
- Normalize text (numbers, abbreviations)
- Generate speech with FastSpeech2 + HiFi-GAN
- Return downloadable MP3
- Add playback controls in UI
Learning: TTS pipeline, text normalization, web scraping, audio encoding
                    

Project 7: Voice Command System

INTERMEDIATE

                        Stack: Whisper + intent classification + TTS response
- Wake word detection (Picovoice Porcupine or custom)
- STT for command capture
- Intent extraction (fine-tuned BERT or regex)
- Execute command (volume, calendar, search, etc.)
- TTS response
Learning: Wake word detection, intent classification, action execution
                    

Project 8: Multi-Speaker Diarization + Transcription

INTERMEDIATE

                        Stack: Whisper + pyannote.audio + spaCy
- Transcribe audio with word timestamps
- Run speaker diarization (pyannote)
- Assign speakers to transcribed words
- Output: "Speaker 1: Hello... Speaker 2: Hi..."
- Format as readable transcript
Learning: Diarization, timestamp alignment, NLP post-processing
                    

Project 9: Fine-Tuned Domain ASR

INTERMEDIATE

                        Stack: Whisper + HuggingFace + custom medical/legal corpus
- Collect domain-specific audio+transcripts
- Fine-tune Whisper small or medium
- Evaluate domain-specific WER improvement
- Deploy via FastAPI
Learning: Transfer learning, dataset preparation, evaluation, deployment
                    

Project 10: Custom Voice Cloner

INTERMEDIATE

                        Stack: XTTS-v2 or YourTTS + FastAPI
- API endpoint: /clone with {text, reference_audio}
- Accept 5–15s reference audio
- Generate speech in cloned voice
- Multiple language support
Learning: Voice cloning, speaker embeddings, API design
                    

🔴 ADVANCED LEVEL (8–18 months)

Project 11: Production STT Service (Commercial Grade)

ADVANCED

                        Features:
  - Multi-language detection and transcription
  - Real-time streaming (WebSocket) + batch API (REST)
  - Speaker diarization
  - Custom vocabulary / hotwords boosting
  - Punctuation and capitalization restoration
  - Confidence scores per word
  - Webhook callbacks for async jobs
  - Dashboard: usage, latency, error rates
  
Stack: Whisper large + NeMo + pyannote + FastAPI + Redis + PostgreSQL + 
       Prometheus + Grafana + Kubernetes + Nginx
       
Scaling: Horizontal pod autoscaling based on GPU queue depth
                    

Project 12: Production TTS Service (API like ElevenLabs)

ADVANCED

                        Features:
  - 20+ pre-built voices with distinct personalities
  - Zero-shot voice cloning from <30s reference
  - Emotion/style control (happy, sad, excited, whisper)
  - SSML support (rate, pitch, emphasis, break)
  - Streaming audio generation
  - 20+ language support
  - REST API + Python/JS SDK
  - Usage billing integration
  
Stack: VITS + XTTS + HiFi-GAN + FastAPI + Stripe + Redis + S3 + CDN
                    

Project 13: Voice Conversion System

ADVANCED

                        Features:
  - Convert speaker identity while preserving content
  - Any-to-any voice conversion
  - Real-time capability (<200ms latency)
  
Architecture:
  Input audio → ASR (content) + Speaker encoder (style)
  → Voice decoder → Target voice audio
  
Models: FreeVC, DDSP-VC, Diff-VC, QuickVC
Learning: Disentanglement, speaker representation, real-time processing
                    

Project 14: End-to-End Speech Translation

ADVANCED

                        Architecture:
  Source language audio → SeamlessM4T / NLLB-Audio
  → Target language text (or audio)
  
Features:
  - Direct speech-to-speech translation (no text intermediate)
  - 100+ language pairs
  - Real-time streaming
  - Preserve prosody/emotion in output
  
Stack: SeamlessM4T (Meta) + FastAPI + WebSocket
                    

Project 15: Train Your Own TTS from Scratch

ADVANCED

                        Steps:
Record 20+ hours of custom voice in studio
Segment and transcribe all audio (Whisper-assisted)
Train FastSpeech2 acoustic model from scratch
Train HiFi-GAN vocoder from scratch
Fine-tune VITS end-to-end
Implement MOS (Mean Opinion Score) evaluation
A/B test against Coqui/ElevenLabs

Learning: Full training pipeline, data curation, model evaluation, production deployment
                    

🔵 RESEARCH / EXPERT LEVEL (18+ months)

Project 16: Codec Language Model TTS (VALL-E style)

RESEARCH

                        Architecture:
  Text → Phonemes → Token sequence → AR Transformer → Coarse codec tokens
  → NAR Transformer → Fine codec tokens → EnCodec decoder → Audio

Training:
  - Pretrain on 10,000+ hours of diverse speech
  - EnCodec tokenizer (8 codebooks, 75Hz)
  - GPT-style LM for coarse tokens
  - BERT-style masked model for fine tokens
  
Innovation opportunities:
  - Better alignment between text and audio tokens
  - Emotion conditioning
  - Efficiency improvements
                    

Project 17: Streaming On-Device STT (Mobile)

RESEARCH

                        Target: <50ms latency, <100MB model, runs on phone CPU
  
Approach:
  - Start with Conformer-Tiny + CTC
  - Quantize to INT8
  - Optimize with TFLite delegate or CoreML
  - Implement streaming chunk processing
  - Add on-device LM rescoring (tiny n-gram)
  
Platforms: Android (TFLite) + iOS (CoreML)
Learning: Mobile ML optimization, quantization, edge deployment
                    

Project 18: Multilingual Universal Speech Model

RESEARCH

                        Scope: Single model for 50+ languages, STT + TTS

STT:
  - Pretrain wav2vec 2.0 on 50-language corpus
  - Fine-tune with multilingual CTC
  - Adapter modules per language
  
TTS:
  - Shared phoneme inventory across languages
  - Language embedding conditioning
  - Cross-lingual transfer for low-resource languages

Evaluation: FLEURS benchmark across all languages
                    

11. CUTTING-EDGE DEVELOPMENTS (2023–2025)

11.1 Speech Recognition

Whisper Large v3 Turbo (2024): 8-layer decoder, 809M params, faster than large-v2 with similar accuracy
Distil-Whisper (2023, Hugging Face): 6x speedup, 49% fewer params, <1% WER degradation
Universal-1 (AssemblyAI, 2024): SOTA commercial STT, best on noisy data
Gemini Audio (Google, 2024): Natively multimodal, audio reasoning
Canary-1B (NVIDIA, 2024): Conformer + attention, multilingual, speech translation
MMS (Meta, 2023): 1000+ language STT using one model
OWSM (2024): Open-source replica of Whisper at larger scale (25k hours+)
parakeet-tdt (NVIDIA, 2024): Token-and-duration transducer, near real-time

11.2 Text-to-Speech

VALL-E 2 (Microsoft, 2024): First TTS to match human quality on VCTK/LibriSpeech, using GRP
NaturalSpeech 3 (Microsoft, 2024): FACodec disentanglement + diffusion
CosyVoice (Alibaba, 2024): LLM-based TTS with flow matching
F5-TTS (2024): Flow matching TTS, DiT architecture, flat text input, SOTA
E2-TTS (2024): Simple flow-matching TTS, impressive quality
FireRedTTS (2024): High-quality Chinese TTS
Kokoro (2024): Small (82M), fast, open-weights, near-SOTA quality
StyleTTS 2 (2023): Diffusion + style modeling, SOTA on LJ Speech
Parler-TTS (2024): Natural language description controls TTS voice
Amphion (2024): Unified open-source TTS/VC/SVC framework
HierSpeech++ (2024): Hierarchical variational inference, high quality

11.3 Voice & Audio Foundation Models

EnCodec (Meta, 2022): Neural audio codec, 24kHz, residual VQ
DAC (Descript, 2023): Improved neural codec, better perceptual quality
AudioPaLM (Google, 2023): Multimodal LLM combining speech + text
SpeechX (Microsoft, 2023): Unified speech model for many tasks
UniAudio (2023): One model for 11 audio tasks
VoxtLM (2024): Language model for joint speech-text
Spirit LM (Meta, 2024): Interleaved speech-text LLM with expressive speech

11.4 Voice Cloning & Conversion

OpenVoice v2 (2024): Near-zero-shot cloning, tone/style/accent control
XTTS v2 (Coqui, 2023): 17-language voice cloning, 6s reference
RVC v2: Real-Time Voice Conversion, widely used for singing conversion
So-VITS-SVC: Singing voice conversion based on VITS
Seed-TTS (ByteDance, 2024): Near-perfect voice cloning, emotional control

11.5 Real-Time & Streaming

Moshi (Kyutai, 2024): Real-time full-duplex speech dialogue system
RealtimeTTS: Python library for ultra-low-latency streaming TTS
moonshine (Useful Sensors, 2024): On-device STT, faster than Whisper tiny
whisper.cpp: C++ Whisper, runs on CPU, iOS, Android, Raspberry Pi

11.6 Key Research Directions (2025+)

Speech LLMs: End-to-end spoken dialogue models (like GPT-4o audio)
Zero-shot multilingual TTS: One model, any language, any voice
Codec-based unified models: Everything tokenized as audio codes
On-device streaming: Sub-100ms full-stack STT+TTS on mobile
Emotional speech: Expressive control beyond speed/pitch
Personalization: Continuous adaptation from user speech
Anti-spoofing: Detecting deepfake audio (ADD challenge)

12. RESOURCES, DATASETS & REFERENCES

12.1 Key Research Papers (Read in Order)

STT Papers

"A tutorial on hidden Markov models" (Rabiner, 1989)
"Deep Speech: Scaling up end-to-end speech recognition" (Baidu, 2014)
"Connectionist Temporal Classification" (Graves et al., 2006)
"Attention-Based Models for Speech Recognition" (Chorowski, 2015)
"wav2vec 2.0: A Framework for Self-Supervised Learning of Speech" (Meta, 2020)
"HuBERT: Self-Supervised Speech Representation Learning" (Meta, 2021)
"Conformer: Convolution-augmented Transformer for SR" (Google, 2020)
"Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper, OpenAI, 2022)
"Distil-Whisper: Robust Knowledge Distillation" (Hugging Face, 2023)

TTS Papers

"WaveNet: A Generative Model for Raw Audio" (DeepMind, 2016)
"Tacotron: Towards End-to-End Speech Synthesis" (Google, 2017)
"Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions" (Tacotron 2, 2018)
"FastSpeech: Fast, Robust and Controllable TTS" (Microsoft, 2019)
"FastSpeech 2: Fast and High-Quality E2E TTS" (Microsoft, 2020)
"HiFi-GAN: Generative Adversarial Networks for Audio Synthesis" (2020)
"VITS: Conditional Variational Autoencoder with Adversarial Learning for E2E TTS" (2021)
"VALL-E: Neural Codec Language Models are Zero-Shot TTS" (Microsoft, 2023)
"Voicebox: Text-Guided Multilingual Universal Speech Generation" (Meta, 2023)
"NaturalSpeech 3: Zero-Shot Copier-Free Voice Cloning" (Microsoft, 2024)
"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech" (2024)

12.2 Datasets

ASR Datasets

LibriSpeech — 960h English, clean+noisy (openslr.org/12)
CommonVoice 17 — Mozilla, 100+ languages (commonvoice.mozilla.org)
VoxPopuli — 1791h EU parliament (github.com/facebookresearch/voxpopuli)
GigaSpeech — 10,000h English diverse (github.com/SpeechColab/GigaSpeech)
AISHELL-1/2 — Mandarin Chinese (openslr.org/33)
MCV Corpus — Hindi, Marathi, Tamil, Telugu + many Indian languages
MUSAN — Noise, music, speech for augmentation
RIR_NOISES — Room impulse responses
FLEURS — Google, 100 languages, 12h each

TTS Datasets

LJ Speech — 24h single speaker, high quality (keithito.com/LJ-Speech-Dataset)
VCTK — 109 speakers, English (datashare.ed.ac.uk)
LibriTTS — 585h multi-speaker, clean (openslr.org/60)
HiFi-TTS — 291h high quality multi-speaker
AISHELL-3 — 85h Mandarin multi-speaker
CSS10 — 10 languages, single speaker each
Kokoro dataset — High-quality curated English
ESD — Emotional speech dataset (5 emotions, 10 speakers)

12.3 Pre-trained Models to Start With

STT

openai/whisper-large-v3 (HuggingFace)
nvidia/parakeet-tdt-1.1b (HuggingFace)
facebook/wav2vec2-large-960h-lv60 (HuggingFace)
speechbrain/asr-conformer-... (SpeechBrain Hub)

TTS

tts_models/en/ljspeech/vits (Coqui TTS)
tts_models/multilingual/multi-dataset/xtts_v2 (Coqui)
hexgrad/Kokoro-82M (HuggingFace)
facebook/mms-tts-eng (HuggingFace)
suno/bark (HuggingFace)

Speaker

speechbrain/spkrec-ecapa-voxceleb (Speaker verification)
pyannote/speaker-diarization-3.1 (Diarization)

12.4 Courses & Learning Resources

COURSES

Stanford CS224S: Spoken Language Processing (free online)
Fast.ai Practical Deep Learning (free)
DeepLearning.AI Sequence Models (Coursera)
CMU 11-751 Speech Recognition (lecture slides free)
Hugging Face Audio Course (huggingface.co/learn/audio-course — FREE)

BOOKS

"Speech and Language Processing" — Jurafsky & Martin (free PDF: web.stanford.edu/~jurafsky/slp3)
"Fundamentals of Speech Recognition" — Rabiner & Juang
"Deep Learning" — Goodfellow, Bengio, Courville (free: deeplearningbook.org)
"Neural Network Methods for NLP" — Goldberg

KEY BLOGS & RESOURCES

Lilian Weng's Blog (lilianweng.github.io) — Excellent deep dives
Papers With Code (paperswithcode.com/task/speech-synthesis)
Hugging Face Blog
NVIDIA Developer Blog
Distill.pub — Visual explanations

COMMUNITIES

r/MachineLearning (Reddit)
Hugging Face Discord
ESPnet GitHub Discussions
SpeechBrain Slack
ML Discord servers

12.5 Benchmarks & Evaluation Tools

STT Benchmarks

LibriSpeech test-clean: Target WER < 2.5% (SOTA ~1.4%)
LibriSpeech test-other: Target WER < 5% (SOTA ~2.7%)
CommonVoice: Multilingual WER
Earnings21: Real-world earnings call transcription
CHiME-6: Noisy far-field challenge
NOIZEUS: Noise robustness

TTS Evaluation

MOS (Mean Opinion Score): Human evaluation 1-5 scale
UTMOS: Automatic MOS predictor (neural)
DNSMOS P.835: Noise/speech quality
SpeechBERTScore: Semantic similarity
PESQ: Perceptual speech quality
STOI: Short-Time Objective Intelligibility
F0 RMSE: Pitch prediction accuracy
MCD (Mel Cepstral Distortion): Acoustic similarity

Tools for Evaluation

                    # WER calculation
pip install jiwer
from jiwer import wer, cer

# MOS prediction
pip install speechmos
from speechmos import dnsmos

# PESQ and STOI
pip install pesq pystoi

# Forced alignment (for TTS duration evaluation)
pip install montreal-forced-aligner
                

12.6 QUICK START CHECKLIST

Week 1: Environment Setup

Install Python 3.10+, CUDA, PyTorch with GPU support
Install librosa, torchaudio, transformers, TTS (Coqui)
Download and run Whisper on a test file
Run Coqui TTS on a test sentence
Plot a mel spectrogram from scratch

Week 2–4: Foundations

Implement MFCC from scratch (no librosa)
Implement simple HMM for digit recognition
Fine-tune Whisper base on a custom 1-hour dataset
Run VITS inference on LJ Speech

Month 2–3: First Models

Train FastSpeech 2 on LJ Speech (2–3 days on RTX 3090)
Train HiFi-GAN vocoder
Build a REST API for STT and TTS
Build a simple web UI for your service

Month 4–6: Production Service

Containerize with Docker
Add authentication, rate limiting
Add monitoring (Prometheus + Grafana)
Deploy to cloud (AWS/GCP/Lambda Labs)
Achieve <300ms TTS latency

Month 7–12: Specialization

Choose: voice cloning, multilingual, on-device, or real-time streaming
Train a model from scratch on custom data
Publish an open-source project or demo
Read 5+ papers from the cutting-edge list

Final Notes

This roadmap covers 2–3 years of full-stack speech AI development. Progress at your own pace, focus on building projects, and always prioritize hands-on coding over passive reading.

Roadmap Version: 2025 — Covers state-of-the-art as of March 2025

Document: TTS_STT_Complete_Roadmap.html

Purpose: Complete conversion from Markdown while preserving exact formatting and structure